Mkulakow/responses api by michalkulakowski · Pull Request #4039 · openvinotoolkit/model_server

michalkulakowski · 2026-03-04T14:49:59Z

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

Copilot

Pull request overview

This PR adds initial support for the OpenAI Responses API to OVMS LLM serving, wiring a new /v3/responses endpoint through request parsing and unary response serialization, with accompanying unit tests.

Changes:

Introduces a new Endpoint::RESPONSES and routes /v3/responses + /v3/v1/responses to it.
Adds Responses-specific request parsing (input, max_output_tokens, conflict check with max_completion_tokens) and unary response JSON serialization (object: "response", output: [...]).
Extends unit tests to cover Responses parsing and unary response structure.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`src/test/http_openai_handler_test.cpp`	Adds unit tests validating Responses parsing and unary response output structure.
`src/llm/servable.cpp`	Routes `/v3/responses` to the new endpoint and prepares inputs for it.
`src/llm/apis/openai_completions.hpp`	Adds `Endpoint::RESPONSES` and declares `parseResponsesPart()`.
`src/llm/apis/openai_completions.cpp`	Implements Responses parsing and a new unary serialization path for Responses output format.

Comments suppressed due to low confidence (2)

src/llm/apis/openai_completions.cpp:826

For RESPONSES requests, the validation error message says "max_tokens value should be greater than 0", but the endpoint-specific fields are max_output_tokens / max_completion_tokens. Updating the message to reference the actual accepted field(s) will make it clearer for API consumers.

    // specific part of max_tokens validation
    if (request.maxTokens == 0) {
        return absl::InvalidArgumentError("max_tokens value should be greater than 0");
    }

src/llm/servable.cpp:76

The invalid-endpoint error message lists /v3/tokenize, but TokenizeParser::isTokenizeEndpoint() accepts any URI that ends with tokenize (including /v3/v1/tokenize). Consider listing both /v3/tokenize and /v3/v1/tokenize (consistent with the other endpoints) or wording it as a suffix-based rule to avoid misleading users.

        return absl::InvalidArgumentError("Wrong endpoint. Allowed endpoints: /v3/chat/completions, /v3/completions, /v3/responses, /v3/tokenize");

src/llm/apis/openai_completions.cpp

Copilot · 2026-03-04T14:54:45Z

src/llm/apis/openai_completions.cpp

+    std::optional<uint32_t> maxCompletionTokens;
+    std::optional<uint32_t> maxOutputTokens;
+
+    // max_completion_tokens: uint; optional
+    it = doc.FindMember("max_completion_tokens");
+    if (it != doc.MemberEnd() && !it->value.IsNull()) {
+        if (!it->value.IsUint()) {
+            if (it->value.IsUint64())
+                return absl::InvalidArgumentError("max_completion_tokens value can't be greater than 4294967295");
+            return absl::InvalidArgumentError("max_completion_tokens is not an unsigned integer");
+        }
+        if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
+            return absl::InvalidArgumentError(absl::StrCat("max_completion_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
+        maxCompletionTokens = it->value.GetUint();
+    }
+
+    // max_output_tokens: uint; optional
+    // OpenAI Responses API uses this field for output token limit.
+    it = doc.FindMember("max_output_tokens");
+    if (it != doc.MemberEnd() && !it->value.IsNull()) {
+        if (!it->value.IsUint()) {
+            if (it->value.IsUint64())
+                return absl::InvalidArgumentError("max_output_tokens value can't be greater than 4294967295");
+            return absl::InvalidArgumentError("max_output_tokens is not an unsigned integer");
+        }
+        if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
+            return absl::InvalidArgumentError(absl::StrCat("max_output_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
+        maxOutputTokens = it->value.GetUint();
+    }
+
+    if (maxCompletionTokens.has_value() && maxOutputTokens.has_value() && maxCompletionTokens.value() != maxOutputTokens.value()) {
+        return absl::InvalidArgumentError("max_output_tokens and max_completion_tokens must match when both are provided");
+    }
+    if (maxOutputTokens.has_value()) {
+        request.maxTokens = maxOutputTokens.value();
+    } else if (maxCompletionTokens.has_value()) {
+        request.maxTokens = maxCompletionTokens.value();
+    }


OpenAIChatCompletionsRequest::maxTokens is std::optional<int>, but in RESPONSES parsing you accept max_output_tokens/max_completion_tokens as uint32_t and then assign into request.maxTokens. Values > INT_MAX will overflow / become implementation-defined. Consider changing maxTokens to an unsigned/wider type (e.g., uint32_t/uint64_t) or clamp + validate against std::numeric_limits<int>::max() before assigning.

Copilot · 2026-03-04T14:54:46Z

src/llm/apis/openai_completions.cpp

+    if (request.maxTokens.has_value()) {
+        writer.String("max_output_tokens");
+        writer.Int(request.maxTokens.value());
+    }
+
+    writer.String("output");
+    writer.StartArray();
+    int outputIndex = 0;
+    for (const auto& parsedOutput : parsedOutputs) {
+        const std::string outputId = "msg-" + std::to_string(outputIndex++);
+
+        writer.StartObject();
+        writer.String("id");
+        writer.String(outputId.c_str());
+        writer.String("type");
+        writer.String("message");
+        writer.String("role");
+        writer.String("assistant");
+        writer.String("status");
+        writer.String("completed");
+        writer.String("content");
+        writer.StartArray();
+        writer.StartObject();
+        writer.String("type");
+        writer.String("output_text");
+        writer.String("text");
+        writer.String(parsedOutput.content.c_str());
+        writer.String("annotations");
+        writer.StartArray();
+        writer.EndArray();
+        writer.EndObject();
+        writer.EndArray();
+        writer.EndObject();
+    }
+    writer.EndArray();
+
+    writer.String("usage");
+    writer.StartObject();
+    writer.String("input_tokens");
+    writer.Int(usage.promptTokens);
+    writer.String("input_tokens_details");
+    writer.StartObject();
+    writer.String("cached_tokens");
+    writer.Int(0);
+    writer.EndObject();
+    writer.String("output_tokens");
+    writer.Int(usage.completionTokens);
+    writer.String("output_tokens_details");
+    writer.StartObject();
+    writer.String("reasoning_tokens");
+    writer.Int(0);
+    writer.EndObject();
+    writer.String("total_tokens");
+    writer.Int(usage.calculateTotalTokens());
+    writer.EndObject();


serializeResponsesUnaryResponse() writes numeric fields with writer.Int(...) (e.g., max_output_tokens, input_tokens, output_tokens, total_tokens). Several of these values come from size_t or may exceed signed 32-bit range, which can overflow and produce incorrect JSON. Prefer Int64/Uint64 (or Uint) for token counters/limits to match the validated range and the underlying types.

@michalkulakowski can you protect against such corner cases?

Copilot · 2026-03-04T14:54:46Z

src/llm/apis/openai_completions.cpp

+    const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();
+    const std::string responseId = "resp-" + std::to_string(createdAt);
+


responseId is derived only from createdAt in seconds ("resp-" + std::to_string(createdAt)), which can easily collide under concurrent load (multiple requests within the same second). Use a higher-resolution timestamp and/or add a per-process counter/UUID component to keep response IDs unique.

dkalinowski · 2026-03-18T14:35:17Z

demos/continuous_batching/vlm/README.md

+:::{dropdown} **Unary call with curl using responses endpoint**
+**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)


Suggested change

:::{dropdown} **Unary call with curl using responses endpoint**

**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)

:::{dropdown} **Unary call with cURL using Responses API**

**Note**: Using cURL in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md).

dkalinowski · 2026-03-18T14:36:31Z

demos/continuous_batching/vlm/README.md

+**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)
+
+```bash
+curl http://localhost:8000/v3/responses  -H "Content-Type: application/json" -d "{ \"model\": \"OpenGVLab/InternVL2-2B\", \"input\":[{\"role\": \"user\", \"content\": [{\"type\": \"input_text\", \"text\": \"Describe what is on the picture.\"},{\"type\": \"input_image\", \"image_url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}]}], \"max_output_tokens\": 100}"


Can you break down the command into multiple lines to showcase what is actually being sent?

dkalinowski · 2026-03-18T14:39:38Z

demos/continuous_batching/README.md

 :::


+### Unary calls to responses endpoint using cURL


Suggested change

### Unary calls to responses endpoint using cURL

### Unary calls via Responses API using cURL

demos/continuous_batching/README.md

dkalinowski · 2026-03-18T14:41:34Z

demos/continuous_batching/vlm/README.md


 :::

+:::{dropdown} **Streaming request with OpenAI client using responses endpoint**


Suggested change

:::{dropdown} **Streaming request with OpenAI client using responses endpoint**

:::{dropdown} **Streaming request with OpenAI client via Responses API**

dkalinowski · 2026-03-18T14:43:53Z

demos/continuous_batching/vlm/README.md

+
+## Benchmarking text generation with high concurrency
+
+OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.


Why is it being added in responses api PR? Is it related?

dkalinowski · 2026-03-18T14:49:03Z

src/llm/apis/openai_completions.cpp

+        writer.String("input_tokens_details");
+        writer.StartObject();
+        writer.String("cached_tokens");
+        writer.Uint64(0);


please add note here that it is not supported

dkalinowski · 2026-03-18T14:49:26Z

src/llm/apis/openai_completions.cpp

+        writer.String("output_tokens_details");
+        writer.StartObject();
+        writer.String("reasoning_tokens");
+        writer.Uint64(0);


can we support it in this release?

dkalinowski · 2026-03-18T14:53:28Z

src/llm/apis/openai_completions.cpp

+
+    if (inputIt->value.IsString()) {
+        request.prompt = inputIt->value.GetString();
+        if (!request.prompt.has_value() || request.prompt.value().empty()) {


request.prompt must have value if you set it one line earlier

dkalinowski · 2026-03-18T14:59:20Z

src/llm/apis/openai_completions.cpp

+                    }
+                    request.imageHistory.push_back({i, tensor});
+                } else {
+                    return absl::InvalidArgumentError("Unsupported content type");


can you include whats the content type? this way we can easily see in logs what was expected supported content type by user

dkalinowski · 2026-03-18T15:05:46Z

src/llm/apis/openai_completions.cpp

+    if (existingMessages != doc.MemberEnd()) {
+        existingMessages->value = messages;
+    } else {
+        doc.AddMember("messages", messages, allocator);


Why is the document being edited? If I understand correctly, we pare doc to our request format. Why do we need to edit the doc? Shouldnt it be left untouched and just work on request in next steps? Why do we parse if we later not use it?

dkalinowski · 2026-03-18T15:09:57Z

src/llm/apis/openai_completions.cpp

            auto& obj = it->value.GetArray()[i];
            if (!obj.IsObject())
                return absl::InvalidArgumentError("Tool is not a JSON object");
+            rapidjson::Value* functionObj = nullptr;


functionObj is never used outside of scope, you dont need to work on pointers, contrary to parametersValue where it is required

dkalinowski · 2026-03-18T15:12:16Z

src/llm/apis/openai_completions.cpp

+                parametersValue->Accept(writer);
+                std::string parametersStr = buffer.GetString();
+                ToolSchemaWrapper schemaReprs{parametersValue, std::move(parametersStr)};
+                request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs);


Suggested change

request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs);

request.toolNameSchemaMap[functionName] = std::move(schemaReprs);

dkalinowski · 2026-03-18T15:42:07Z

src/llm/apis/openai_completions.cpp

        if (it->value.GetUint() == 0)
            return absl::InvalidArgumentError("n value should be greater than 0");
+        if (endpoint == Endpoint::RESPONSES && request.stream && it->value.GetUint() > 1)
+            return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming");


Suggested change

return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming");

return absl::InvalidArgumentError("n greater than 1 is not supported for Responses API streaming");

dkalinowski · 2026-03-18T15:46:04Z

src/llm/apis/openai_completions.cpp

+    if (endpoint == Endpoint::RESPONSES) {
+        std::vector<ParsedOutput> parsedOutputs;
+        for (const auto& tokens : results.tokens) {
+            updateUsage(usage, tokens, request.echo);


We just recently switched to using PerfMetrics @mzegla
Why cant we use it here?

dkalinowski · 2026-03-18T15:47:55Z

src/llm/apis/openai_completions.cpp

+            events.emplace_back(serializeResponsesEvent([this, &responseId, createdAt](Writer<StringBuffer>& writer) {
+                writer.StartObject();
+                writer.String("type");
+                writer.String("response.created");


it might not be good place for this event to be streamed, please verify vs other servings maybe if OpenAI is not specific enough

and all other events

mzegla

It's a big change, let's review it iteratively. Pushing comments only to the logic - did not review tests just yet.

demos/continuous_batching/vlm/README.md

mzegla · 2026-03-20T12:57:07Z

src/llm/apis/openai_completions.cpp

    throw std::invalid_argument("Unsupported JSON value type");
 }

+std::string serializeResponsesEvent(const std::function<void(Writer<StringBuffer>&)>& eventSerializer) {


how does it work? evenSerializer carries the data itself?
it's not clear to me what the output string would be

mzegla · 2026-03-20T12:57:41Z

src/llm/apis/openai_completions.cpp

+void serializeNotSupportedNullField(Writer<StringBuffer>& writer, const char* fieldName) {
+    writer.String(fieldName);
+    writer.Null();
+}
+
+void serializeNotSupportedZeroField(Writer<StringBuffer>& writer, const char* fieldName) {
+    writer.String(fieldName);
+    writer.Uint64(0);
+}
+
+void serializeNotSupportedEmptyArrayField(Writer<StringBuffer>& writer, const char* fieldName) {
+    writer.String(fieldName);
+    writer.StartArray();
+    writer.EndArray();
+}


why not supported?

src/llm/apis/openai_completions.cpp

mzegla · 2026-03-20T13:02:17Z

src/llm/apis/openai_completions.cpp

+    const char* status, const std::string& fullOutputText, bool includeUsage,
+    const char* incompleteReason, const char* errorMessage, const char* errorCode) const {


Lots of parameters. Can we use std::strings and get const char* internally, so it's more straightforward to use?

mzegla · 2026-03-20T14:36:44Z

src/llm/apis/openai_completions.hpp

    std::string serializeStreamingUsageChunk();
    std::string serializeStreamingHandshakeChunk();
+    std::string serializeResponsesStreamingInitEvents();
+    std::string serializeResponsesFailedEvent(const std::string& errorMessage, const char* errorCode = "server_error");


error code as string? maybe we could have some enum here

mzegla · 2026-03-20T14:40:24Z

src/llm/http_llm_calculator.cc

+            if (executionContext->apiHandler && executionContext->apiHandler->isStream() && executionContext->endpoint == Endpoint::RESPONSES) {
+                std::string failedEvent = executionContext->apiHandler->serializeResponsesFailedEvent(e.what());
+                executionContext->response = wrapTextInServerSideEventMessage(failedEvent);
+                executionContext->response += wrapTextInServerSideEventMessage("[DONE]");
+                cc->Outputs().Tag(OUTPUT_TAG_NAME).Add(new std::string{std::move(executionContext->response)}, iterationBeginTimestamp);
+                return absl::OkStatus();
+            }
            return absl::InvalidArgumentError(e.what());
        } catch (...) {
+            if (executionContext->apiHandler && executionContext->apiHandler->isStream() && executionContext->endpoint == Endpoint::RESPONSES) {
+                std::string failedEvent = executionContext->apiHandler->serializeResponsesFailedEvent("Response generation failed");
+                executionContext->response = wrapTextInServerSideEventMessage(failedEvent);
+                executionContext->response += wrapTextInServerSideEventMessage("[DONE]");
+                cc->Outputs().Tag(OUTPUT_TAG_NAME).Add(new std::string{std::move(executionContext->response)}, iterationBeginTimestamp);
+                return absl::OkStatus();
+            }


Are you sure that it will be handled correctly with OK status? I mean for example, will it exit properly?

Yes, in responses api streaming, opposites to chat/completions, error handling is realized through special event rather than status code.

mzegla · 2026-03-20T14:43:09Z

src/llm/http_llm_calculator.cc

+                    std::string initEvents = executionContext->apiHandler->serializeResponsesStreamingInitEvents();
+                    if (!initEvents.empty()) {
+                        executionContext->response = wrapTextInServerSideEventMessage(initEvents);
+                        cc->Outputs().Tag(OUTPUT_TAG_NAME).Add(new std::string{std::move(executionContext->response)}, iterationBeginTimestamp);
+                        executionContext->response = "";
+                    }
+                    cc->Outputs().Tag(LOOPBACK_TAG_NAME).Add(new bool{true}, iterationBeginTimestamp);


Is that whole thing okay? You schedule and don't read/check output before return.
Aren't we at risk of skipping chunk? Especially if it's generated fast?

Yes, because here we emits responses api events that don't contain data generate by pipeline

mzegla · 2026-03-20T14:46:35Z

src/llm/py_jinja_template_processor.cpp

    return false;
 }

+static py::object rapidJsonValueToPyObject(const rapidjson::Value& value) {


Is that whole thing necessary? Why don't we serialize to string -> pass to python -> read to json?

src/llm/servable.cpp

Copilot

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.

Copilot · 2026-03-23T10:03:49Z

src/llm/apis/openai_completions.cpp

+std::string OpenAIChatCompletionsHandler::serializeResponseCreatedEvent(const std::string& responseId, int64_t createdAt) {
+    StringBuffer buffer;
+    Writer<StringBuffer> writer(buffer);
+    writeEventHeader(writer, "response.created");
+    writer.String("response");
+    serializeResponsesResponseObject(writer, responseId, createdAt, "in_progress", "", false);
+    writer.EndObject();
+    return buffer.GetString();
+}
+
+std::string OpenAIChatCompletionsHandler::serializeResponseInProgressEvent(const std::string& responseId, int64_t createdAt) {
+    StringBuffer buffer;
+    Writer<StringBuffer> writer(buffer);
+    writeEventHeader(writer, "response.in_progress");
+    writer.String("response");
+    serializeResponsesResponseObject(writer, responseId, createdAt, "in_progress", "", false);
+    writer.EndObject();


response.created / response.in_progress events serialize a full response object via serializeResponsesResponseObject(), which currently always emits an output array (and may include a placeholder message item). This conflicts with the subsequent emission of response.output_item.added / response.content_part.added events and can lead to duplicated/contradictory stream semantics. Consider making created/in_progress events omit output entirely (or emit an empty output array) and rely on the dedicated output_item/content_part events for initialization.

Copilot · 2026-03-23T10:03:49Z

src/llm/apis/openai_completions.cpp

+    // TODO: error not supported in unary response
+    writer.String("model");
+    writer.String(request.model.c_str());
+    writer.String("status");
+    writer.String(responseStatus.c_str());


serializeResponsesUnaryResponse() does not emit several spec-aligned fields that the new unit tests expect (e.g. "error":null, "previous_response_id":null, "reasoning":null, "user":null). This will cause the added tests (e.g. serializeUnaryResponseForResponses*) to fail and makes unary and streaming response objects structurally inconsistent. Reuse serializeResponsesResponseObject() for unary responses or explicitly add the missing null fields so the unary payload matches the Responses API schema.

Copilot · 2026-03-23T10:03:49Z

src/llm/apis/openai_completions.cpp

+}
+
+std::string OpenAIChatCompletionsHandler::serializeResponsesStreamingInitEvents() {
+    const auto createdAt = std::chrono::duration_cast<std::chrono::microseconds>(created.time_since_epoch()).count();


Responses streaming uses createdAt in microseconds (duration_cast<microseconds>) and passes it to created_at in the response object. This makes created_at inconsistent with unary responses (seconds) and can even be larger than completed_at (which is seconds), breaking ordering and likely violating the OpenAI Responses API expectations. Use a consistent unit (typically epoch seconds) for created_at/IDs across unary and streaming, and ensure completed_at is in the same unit family.

Suggested change

const auto createdAt = std::chrono::duration_cast<std::chrono::microseconds>(created.time_since_epoch()).count();

const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();

mzegla · 2026-03-23T10:03:21Z

demos/continuous_batching/accuracy/gorilla.patch

+        is_fc_model=True,
+        underscore_to_dot=True,
+    ),
+    "ovms-model-stream-responses": ModelConfig(


maybe ovms-model-responses-stream?
more straightforward switching

mzegla · 2026-03-23T10:19:33Z

src/llm/servable.cpp

+            }
+#else
+            ov::genai::ChatHistory& chatHistory = executionContext->apiHandler->getChatHistory();
+            constexpr bool add_generation_prompt = true;


Suggested change

constexpr bool add_generation_prompt = true;

constexpr bool addGenerationPrompt = true;

?

mzegla · 2026-03-23T10:29:25Z

src/llm/apis/openai_completions.cpp

+        writer.String(toolCall.arguments.c_str());
+        writer.EndObject();
+    }
+    if (!fullOutputText.empty() || responsesState.toolCalls.empty()) {


If we have assumption that if there were tool calls then there is no other content we should document it - for example as a comment here to be explicit.

mzegla · 2026-03-23T10:35:13Z

src/llm/apis/openai_completions.cpp

+                }
+            }
+
+            request.chatHistory.last()["content"] = contentText;


@michalkulakowski

src/llm/apis/openai_completions.cpp

mzegla · 2026-03-23T10:43:37Z

src/llm/apis/openai_completions.cpp

+                if (input_ids.get_shape().size() != 2)
+                    throw std::runtime_error("input_ids should have 2 dimensions");
+                if (input_ids.get_shape()[0] != 1)
+                    throw std::runtime_error("input_ids should have 1 batch size");
+                if (input_ids.get_element_type() != ov::element::i64)
+                    throw std::runtime_error("input_ids should have i64 element type");


Does user see those errors or are they on our DEBUG logging?
User does not see input_ids, so it's strictly debugging information

I assume these exception should never happen, but we have same check in chat completions part few lines below.

mzegla · 2026-03-23T10:46:08Z

src/llm/apis/openai_completions.cpp

+
+                int64_t* inputIdsData = reinterpret_cast<int64_t*>(input_ids.data());
+                std::vector<int64_t> generatedTokens(inputIdsData, inputIdsData + input_ids.get_shape()[1]);
+                parsedOutputs.push_back(parseOutputIfNeeded(generatedTokens));


parseOutputIfNeeded strictly requires tokens? that's why we need to do this entire operation here?

mzegla · 2026-03-23T10:50:53Z

src/llm/apis/openai_completions.cpp

+                    const auto& toolCallsArr = deltaObj["tool_calls"];
+                    for (rapidjson::SizeType i = 0; i < toolCallsArr.Size(); ++i) {
+                        const auto& tc = toolCallsArr[i];
+                        int tcIndex = tc.HasMember("index") ? tc["index"].GetInt() : 0;


Is it possible that for two iterations in the loop we get to have tcIndex = 0 here and overwrite?

I assume that its not possible, since BaseOutputParser always include id in wrapDelta methods

dkalinowski · 2026-03-23T14:04:50Z

src/llm/apis/openai_completions.cpp

+    writer.String("parallel_tool_calls");
+    writer.Bool(false);


We have sup port for paralellel tool calls, why is it hardcoded to false here?

dkalinowski · 2026-03-23T14:16:54Z

src/llm/apis/openai_completions.cpp

+    if (!responsesState.reasoningText.empty()) {
+        writer.StartObject();
+        writer.String("id");
+        writer.String("rs-0");


what is rs-0? is it duplicated in multi-turn conversation? or is it for cases where model mixes reasoning and content multiple times within 1 unary response?

dkalinowski · 2026-03-23T14:52:25Z

src/llm/apis/openai_completions.cpp

+        writer.StartObject();
+        writer.String("input_tokens");
+        writer.Uint64(static_cast<uint64_t>(usage.promptTokens));
+        // TODO: input_tokens_details.cached_tokens not supported


There is table for completions and chat completions API what we support and what we dont. Is there a chance that we could have the same for Responses API? This way, in future we will be able to talk about it clearny and see what is missing and estimate next tasks correctly. I'm just afraid people will assume something is done therefore where in fact its not

dkalinowski · 2026-03-23T14:56:52Z

src/llm/apis/openai_completions.cpp

+    } else {
+        return absl::InvalidArgumentError("input is not a string or array");
+    }
+


It's missing my bugfix: https://github.com/openvinotoolkit/model_server/pull/4079/changes

dkalinowski · 2026-03-23T15:21:46Z

src/llm/apis/openai_completions.cpp

+            auto kwargsIt = doc.FindMember("chat_template_kwargs");
+            if (kwargsIt == doc.MemberEnd()) {
+                rapidjson::Value kwargs(rapidjson::kObjectType);
+                kwargs.AddMember("enable_thinking", true, doc.GetAllocator());


Barely any model really uses this to enable thinking. Qwen3-VL for example doesnt. Is it an issue? @michalkulakowski @dtrawins

dkalinowski · 2026-03-23T15:35:46Z

src/llm/apis/openai_completions.cpp

+    std::stringstream ss;
+    ss << events.front();
+    for (size_t i = 1; i < events.size(); ++i) {
+        ss << "\n\ndata: " << events[i];


There's already util for that, why cant we reuse it here? wrapTextInServerSideEventMessage

Copilot AI review requested due to automatic review settings March 4, 2026 14:50

Copilot started reviewing on behalf of michalkulakowski March 4, 2026 14:50 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

michalkulakowski force-pushed the mkulakow/responses_api branch 2 times, most recently from ea72634 to 299e2be Compare March 5, 2026 17:42

michalkulakowski force-pushed the mkulakow/responses_api branch from e598abe to 852da04 Compare March 18, 2026 12:57

dkalinowski reviewed Mar 18, 2026

View reviewed changes

demos/continuous_batching/README.md Show resolved Hide resolved

dkalinowski reviewed Mar 18, 2026

View reviewed changes

mzegla reviewed Mar 20, 2026

View reviewed changes

michalkulakowski requested review from Copilot, dkalinowski and mzegla March 23, 2026 09:56

Copilot started reviewing on behalf of michalkulakowski March 23, 2026 09:57 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

mzegla reviewed Mar 23, 2026

View reviewed changes

dkalinowski reviewed Mar 23, 2026

View reviewed changes

michalkulakowski and others added 23 commits March 23, 2026 16:43

Responses api init

879f0d1

fix

b97b956

style

63ec62d

fix

880f302

style

285d08d

remove redundant tests

5f7fc59

update tools parsing in responses

1f80024

fix

7cba591

fix

d8e7e63

fix

5afb766

fix

2c251a7

fixes

d69353c

style

26b7893

fix

c624ee9

style

adc41b9

fix

72d1a49

style

3543f28

style

e5123a5

fix

06d2a4b

fix

60554b4

update patch

ff8a3b8

style

b300da7

fix

abcebb9

michalkulakowski force-pushed the mkulakow/responses_api branch from 5bea7d1 to abcebb9 Compare March 23, 2026 15:43

fix

b1d5b4e

		const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();
		const std::string responseId = "resp-" + std::to_string(createdAt);

		:::{dropdown} Unary call with curl using responses endpoint
		Note: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)

	### Unary calls to responses endpoint using cURL
	### Unary calls via Responses API using cURL


		:::

		:::{dropdown} Streaming request with OpenAI client using responses endpoint


		## Benchmarking text generation with high concurrency

		OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.

	request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs);
	request.toolNameSchemaMap[functionName] = std::move(schemaReprs);

	return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming");
	return absl::InvalidArgumentError("n greater than 1 is not supported for Responses API streaming");

		const char* status, const std::string& fullOutputText, bool includeUsage,
		const char* incompleteReason, const char* errorMessage, const char* errorCode) const {

	constexpr bool add_generation_prompt = true;
	constexpr bool addGenerationPrompt = true;

Conversation

michalkulakowski commented Mar 4, 2026

🛠 Summary

🧪 Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzegla left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment